semantic id
A Related Work .
Semantic IDs created using an auto-encoder (RQ-V AE [40, 21]) for retrieval models. We refer to V ector Quantization as the process of converting a high-dimensional vector into a low-dimensional tuple of codewords. We discuss this technique in more detail in Subsection 3.1. We use users' review history During training, we limit the number of items in a user's history to 20. The results for this dataset are reported in Table 7 as the row'P5'.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Europe > Germany > Berlin (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
Recommender Systems with Generative Retrieval
Modern recommender systems perform large-scale retrieval by embedding queries and item candidates in the same unified space, followed by approximate nearest neighbor search to select top candidates given a query embedding. In this paper, we propose a novel generative retrieval approach, where the retrieval model autoregressively decodes the identifiers of the target candidates. To that end, we create semantically meaningful tuple of codewords to serve as a Semantic ID for each item. Given Semantic IDs for items in a user session, a Transformer-based sequence-to-sequence model is trained to predict the Semantic ID of the next item that the user will interact with. We show that recommender systems trained with the proposed paradigm significantly outperform the current SOTA models on various datasets. In addition, we show that incorporating Semantic IDs into the sequence-to-sequence model enhances its ability to generalize, as evidenced by the improved retrieval performance observed for items with no prior interaction history.
Multi-Aspect Cross-modal Quantization for Generative Recommendation
Zhang, Fuwei, Liu, Xiaoyu, Xi, Dongbo, Yin, Jishen, Chen, Huan, Yan, Peng, Zhuang, Fuzhen, Zhang, Zhao
Generative Recommendation (GR) has emerged as a new paradigm in recommender systems. This approach relies on quantized representations to discretize item features, modeling users' historical interactions as sequences of discrete tokens. Based on these tokenized sequences, GR predicts the next item by employing next-token prediction methods. The challenges of GR lie in constructing high-quality semantic identifiers (IDs) that are hierarchically organized, minimally conflicting, and conducive to effective generative model training. However, current approaches remain limited in their ability to harness multimodal information and to capture the deep and intricate interactions among diverse modalities, both of which are essential for learning high-quality semantic IDs and for effectively training GR models. To address this, we propose Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec), which introduces multi-modal information and incorporates it into both semantic ID learning and generative model training from different aspects. Specifically, we first introduce cross-modal quantization during the ID learning process, which effectively reduces conflict rates and thus improves codebook usability through the complementary integration of multimodal information. In addition, to further enhance the generative ability of our GR model, we incorporate multi-aspect cross-modal alignments, including the implicit and explicit alignments. Finally, we conduct extensive experiments on three well-known recommendation datasets to demonstrate the effectiveness of our proposed method.
- Information Technology > Services (0.46)
- Leisure & Entertainment (0.46)
NEZHA: A Zero-sacrifice and Hyperspeed Decoding Architecture for Generative Recommendations
Wang, Yejing, Zhou, Shengyu, Lu, Jinyu, Liu, Ziwei, Liu, Langming, Wang, Maolin, Zhang, Wenlin, Li, Feng, Su, Wenbo, Wang, Pengjie, Xu, Jian, Zhao, Xiangyu
Generative Recommendation (GR), powered by Large Language Models (LLMs), represents a promising new paradigm for industrial recommender systems. However, their practical application is severely hindered by high inference latency, which makes them infeasible for high-throughput, real-time services and limits their overall business impact. While Speculative Decoding (SD) has been proposed to accelerate the autoregressive generation process, existing implementations introduce new bottlenecks: they typically require separate draft models and model-based verifiers, requiring additional training and increasing the latency overhead. In this paper, we address these challenges with NEZHA, a novel architecture that achieves hyperspeed decoding for GR systems without sacrificing recommendation quality. Specifically, NEZHA integrates a nimble autoregressive draft head directly into the primary model, enabling efficient self-drafting. This design, combined with a specialized input prompt structure, preserves the integrity of sequence-to-sequence generation. Furthermore, to tackle the critical problem of hallucination, a major source of performance degradation, we introduce an efficient, model-free verifier based on a hash set. We demonstrate the effectiveness of NEZHA through extensive experiments on public datasets and have successfully deployed the system on Taobao since October 2025, driving the billion-level advertising revenue and serving hundreds of millions of daily active users.
- Asia > China > Hong Kong (0.05)
- Asia > Middle East > UAE > Dubai Emirate > Dubai (0.05)
- Asia > China > Beijing > Beijing (0.05)
- (3 more...)
- Asia > China (0.04)
- North America > United States (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
- Workflow (0.67)
LLaDA-Rec: Discrete Diffusion for Parallel Semantic ID Generation in Generative Recommendation
Shi, Teng, Shen, Chenglei, Yu, Weijie, Nie, Shen, Li, Chongxuan, Zhang, Xiao, He, Ming, Han, Yan, Xu, Jun
Generative recommendation represents each item as a semantic ID, i.e., a sequence of discrete tokens, and generates the next item through autoregressive decoding. While effective, existing autoregressive models face two intrinsic limitations: (1) unidirectional constraints, where causal attention restricts each token to attend only to its predecessors, hindering global semantic modeling; and (2) error accumulation, where the fixed left-to-right generation order causes prediction errors in early tokens to propagate to the predictions of subsequent token. To address these issues, we propose LLaDA-Rec, a discrete diffusion framework that reformulates recommendation as parallel semantic ID generation. By combining bidirectional attention with the adaptive generation order, the approach models inter-item and intra-item dependencies more effectively and alleviates error accumulation. Specifically, our approach comprises three key designs: (1) a parallel tokenization scheme that produces semantic IDs for bidirectional modeling, addressing the mismatch between residual quantization and bidirectional architectures; (2) two masking mechanisms at the user-history and next-item levels to capture both inter-item sequential dependencies and intra-item semantic relationships; and (3) an adapted beam search strategy for adaptive-order discrete diffusion decoding, resolving the incompatibility of standard beam search with diffusion-based generation. Experiments on three real-world datasets show that LLaDA-Rec consistently outperforms both ID-based and state-of-the-art generative recommenders, establishing discrete diffusion as a new paradigm for generative recommendation.
- Asia > China > Beijing > Beijing (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > Puerto Rico > San Juan > San Juan (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.90)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)
MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation
Xu, Yi, Zhang, Moyu, Li, Chenxuan, Liao, Zhihao, Xing, Haibo, Deng, Hao, Hu, Jinxin, Zhang, Yu, Zeng, Xiaoyi, Zhang, Jing
Recommender systems traditionally represent items using unique identifiers (ItemIDs), but this approach struggles with large, dynamic item corpora and sparse long-tail data, limiting scalability and generalization. Semantic IDs, derived from multimodal content such as text and images, offer a promising alternative by mapping items into a shared semantic space, enabling knowledge transfer and improving recommendations for new or rare items. However, existing methods face two key challenges: (1) balancing cross-modal synergy with modality-specific uniqueness, and (2) bridging the semantic-behavioral gap, where semantic representations may misalign with actual user preferences. To address these challenges, we propose Multimodal Mixture-of-Quantization (MMQ), a two-stage framework that trains a novel multimodal tokenizer. First, a shared-specific tokenizer leverages a multi-expert architecture with modality-specific and modality-shared experts, using orthogonal regularization to capture comprehensive multimodal information. Second, behavior-aware fine-tuning dynamically adapts semantic IDs to downstream recommendation objectives while preserving modality information through a multimodal reconstruction loss. Extensive offline experiments and online A/B tests demonstrate that MMQ effectively unifies multimodal synergy, specificity, and behavioral adaptation, providing a scalable and versatile solution for both generative retrieval and discriminative ranking tasks.
- Asia > China > Beijing > Beijing (0.41)
- North America > United States > California > San Mateo County > Menlo Park (0.04)
- Asia > China > Hubei Province > Wuhan (0.04)
- (4 more...)
Pctx: Tokenizing Personalized Context for Generative Recommendation
Zhong, Qiyong, Su, Jiajie, Ma, Yunshan, McAuley, Julian, Hou, Yupeng
Generative recommendation (GR) models tokenize each action into a few discrete tokens (called semantic IDs) and autoregressively generate the next tokens as predictions, showing advantages such as memory efficiency, scalability, and the potential to unify retrieval and ranking. Despite these benefits, existing tokenization methods are static and non-personalized. They typically derive semantic IDs solely from item features, assuming a universal item similarity that overlooks user-specific perspectives. However, under the autoregressive paradigm, semantic IDs with the same prefixes always receive similar probabilities, so a single fixed mapping implicitly enforces a universal item similarity standard across all users. In practice, the same item may be interpreted differently depending on user intentions and preferences. To address this issue, we propose a personalized context-aware tokenizer that incorporates a user's historical interactions when generating semantic IDs. This design allows the same item to be tokenized into different semantic IDs under different user contexts, enabling GR models to capture multiple interpretive standards and produce more personalized predictions. Experiments on three public datasets demonstrate up to 11.44% improvement in NDCG@10 over non-personalized action tokenization baselines. Our code is available at https://github.com/YoungZ365/Pctx.
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Singapore (0.04)
- Leisure & Entertainment > Games > Computer Games (1.00)
- Government > Military (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Data Science (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
PLUM: Adapting Pre-trained Language Models for Industrial-scale Generative Recommendations
He, Ruining, Heldt, Lukasz, Hong, Lichan, Keshavan, Raghunandan, Mao, Shifan, Mehta, Nikhil, Su, Zhengyang, Tsai, Alicia, Wang, Yueqi, Wang, Shao-Chuan, Yi, Xinyang, Baugher, Lexi, Cakici, Baykal, Chi, Ed, Goodrow, Cristos, Han, Ningren, Ma, He, Rosales, Romer, Van Soest, Abby, Tandon, Devansh, Wu, Su-Lin, Yang, Weilong, Zheng, Yilin
Large Language Models (LLMs) pose a new paradigm of modeling and computation for information tasks. Recommendation systems are a critical application domain poised to benefit significantly from the sequence modeling capabilities and world knowledge inherent in these large models. In this paper, we introduce PLUM, a framework designed to adapt pre-trained LLMs for industry-scale recommendation tasks. PLUM consists of item tokenization using Semantic IDs, continued pre-training (CPT) on domain-specific data, and task-specific fine-tuning for recommendation objectives. For fine-tuning, we focus particularly on generative retrieval, where the model is directly trained to generate Semantic IDs of recommended items based on user context. We conduct comprehensive experiments on large-scale internal video recommendation datasets. Our results demonstrate that PLUM achieves substantial improvements for retrieval compared to a heavily-optimized production model built with large embedding tables. We also present a scaling study for the model's retrieval performance, our learnings about CPT, a few enhancements to Semantic IDs, along with an overview of the training and inference methods that enable launching this framework to billions of users in YouTube.
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- (9 more...)